fbeta_score (Fβ score)#
The Fβ score measures the quality of a binary classifier by combining precision and recall into a single number.
It generalizes the F1 score by letting you choose how much more you care about recall vs precision.
Learning goals#
Define Fβ from the confusion matrix (math + intuition)
Implement
fbeta_scorefrom scratch in NumPy (with edge cases)Visualize how β and the decision threshold change the score (Plotly)
Use Fβ to optimize a simple classifier (threshold tuning + a smooth surrogate)
Quick import (reference)#
from sklearn.metrics import fbeta_score
Prerequisites#
Binary classification with labels in
{0, 1}(we treat1as the positive class)Confusion matrix terms: TP, FP, FN, TN
Basic NumPy
This notebook focuses on binary Fβ. For multiclass/multilabel, most libraries compute Fβ via one-vs-rest + averaging (micro/macro/weighted).
import numpy as np
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
pio.templates.default = "plotly_white"
rng = np.random.default_rng(0)
np.set_printoptions(precision=4, suppress=True)
Confusion matrix (binary)#
Let:
true labels: (y \in {0, 1})
predicted labels: (\hat{y} \in {0, 1})
1is the positive class
(\hat{y}=1) |
(\hat{y}=0) |
|
|---|---|---|
(y=1) |
TP |
FN |
(y=0) |
FP |
TN |
Important: Fβ does not use TN. That’s a feature (when you care mostly about the positive class), but also a limitation.
def sigmoid(z):
z = np.asarray(z, dtype=float)
z = np.clip(z, -50.0, 50.0)
return 1.0 / (1.0 + np.exp(-z))
def safe_divide(numer, denom, *, zero_division=0.0):
"""Elementwise numer/denom with a configurable value when denom == 0."""
numer = np.asarray(numer, dtype=float)
denom = np.asarray(denom, dtype=float)
out = np.full_like(numer + denom, fill_value=float(zero_division), dtype=float)
np.divide(numer, denom, out=out, where=denom != 0)
return out
def confusion_counts_binary(y_true, y_pred, *, pos_label=1):
y_true = np.asarray(y_true)
y_pred = np.asarray(y_pred)
if y_true.shape != y_pred.shape:
raise ValueError(f"shape mismatch: y_true{y_true.shape} vs y_pred{y_pred.shape}")
y_true_pos = y_true == pos_label
y_pred_pos = y_pred == pos_label
tp = int(np.sum(y_true_pos & y_pred_pos))
fp = int(np.sum(~y_true_pos & y_pred_pos))
fn = int(np.sum(y_true_pos & ~y_pred_pos))
tn = int(np.sum(~y_true_pos & ~y_pred_pos))
return tp, fp, fn, tn
def precision_recall_fbeta_from_counts(tp, fp, fn, *, beta=1.0, zero_division=0.0):
if beta <= 0:
raise ValueError("beta must be > 0")
beta2 = beta**2
precision = float(safe_divide(tp, tp + fp, zero_division=zero_division))
recall = float(safe_divide(tp, tp + fn, zero_division=zero_division))
fbeta = float(
safe_divide(
(1.0 + beta2) * tp,
(1.0 + beta2) * tp + beta2 * fn + fp,
zero_division=zero_division,
)
)
return precision, recall, fbeta
def precision_recall_fbeta(y_true, y_pred, *, beta=1.0, pos_label=1, zero_division=0.0):
tp, fp, fn, tn = confusion_counts_binary(y_true, y_pred, pos_label=pos_label)
return precision_recall_fbeta_from_counts(tp, fp, fn, beta=beta, zero_division=zero_division)
def fbeta_score_numpy(y_true, y_pred, *, beta=1.0, pos_label=1, zero_division=0.0):
_, _, fbeta = precision_recall_fbeta(
y_true, y_pred, beta=beta, pos_label=pos_label, zero_division=zero_division
)
return fbeta
Precision, recall, and Fβ (math)#
Precision and recall are:
The Fβ score is a weighted harmonic mean of (P) and (R):
A very useful confusion-matrix form is:
How β changes the trade-off
(\beta = 1) gives F1 (precision and recall weighted equally)
(\beta > 1) favors recall (it upweights FN by (\beta^2))
(\beta < 1) favors precision
y_true = np.array([1, 1, 1, 0, 0, 0, 0])
y_pred = np.array([1, 0, 1, 1, 0, 0, 0])
for beta in [0.5, 1.0, 2.0]:
p, r, f = precision_recall_fbeta(y_true, y_pred, beta=beta)
print(f"beta={beta:>3}: precision={p:.3f}, recall={r:.3f}, Fbeta={f:.3f}")
try:
from sklearn.metrics import fbeta_score as skl_fbeta_score
print("\nscikit-learn check:")
for beta in [0.5, 1.0, 2.0]:
print(f"beta={beta:>3}: sklearn={skl_fbeta_score(y_true, y_pred, beta=beta):.3f}")
except Exception as e:
print("\n(scikit-learn not available for comparison)")
print("Reason:", repr(e))
beta=0.5: precision=0.667, recall=0.667, Fbeta=0.667
beta=1.0: precision=0.667, recall=0.667, Fbeta=0.667
beta=2.0: precision=0.667, recall=0.667, Fbeta=0.667
scikit-learn check:
beta=0.5: sklearn=0.667
beta=1.0: sklearn=0.667
beta=2.0: sklearn=0.667
Scores vs labels: the role of the decision threshold#
Many models output a score or probability (s(x) \in [0,1]), then convert it to a label using a threshold (t):
Increasing (t) usually increases precision (fewer predicted positives) but decreases recall.
Since Fβ depends on TP/FP/FN, it depends on the choice of (t).
A very common workflow is:
Train a model with a differentiable loss (e.g., log loss)
Choose (t) on a validation set to maximize your target metric (e.g., F2)
def pr_fbeta_curve(y_true, y_score, *, beta=1.0, thresholds=None, zero_division=0.0):
y_true = np.asarray(y_true).astype(int)
y_score = np.asarray(y_score, dtype=float)
if thresholds is None:
thresholds = np.linspace(0.0, 1.0, 301)
thresholds = np.asarray(thresholds, dtype=float)
pred_pos = y_score[:, None] >= thresholds[None, :]
y_pos = (y_true == 1)[:, None]
tp = np.sum(pred_pos & y_pos, axis=0)
fp = np.sum(pred_pos & ~y_pos, axis=0)
fn = np.sum(~pred_pos & y_pos, axis=0)
precision = safe_divide(tp, tp + fp, zero_division=zero_division)
recall = safe_divide(tp, tp + fn, zero_division=zero_division)
beta2 = beta**2
fbeta = safe_divide(
(1.0 + beta2) * tp,
(1.0 + beta2) * tp + beta2 * fn + fp,
zero_division=zero_division,
)
return thresholds, precision, recall, fbeta, tp, fp, fn
def best_threshold_for_fbeta(y_true, y_score, *, beta=1.0, thresholds=None, zero_division=0.0):
thresholds, precision, recall, fbeta, tp, fp, fn = pr_fbeta_curve(
y_true, y_score, beta=beta, thresholds=thresholds, zero_division=zero_division
)
i = int(np.nanargmax(fbeta))
return {
"threshold": float(thresholds[i]),
"fbeta": float(fbeta[i]),
"precision": float(precision[i]),
"recall": float(recall[i]),
"tp": int(tp[i]),
"fp": int(fp[i]),
"fn": int(fn[i]),
"index": i,
"thresholds": thresholds,
"precision_curve": precision,
"recall_curve": recall,
"fbeta_curve": fbeta,
}
# Toy example: probability-like scores with overlap + class imbalance
n_pos, n_neg = 180, 820
y_true = np.r_[np.ones(n_pos, dtype=int), np.zeros(n_neg, dtype=int)]
y_score = np.r_[rng.beta(5, 2, size=n_pos), rng.beta(2, 5, size=n_neg)]
perm = rng.permutation(len(y_true))
y_true, y_score = y_true[perm], y_score[perm]
thresholds = np.linspace(0.0, 1.0, 301)
_, precision, recall, _, _, _, _ = pr_fbeta_curve(y_true, y_score, beta=1.0, thresholds=thresholds)
best_05 = best_threshold_for_fbeta(y_true, y_score, beta=0.5, thresholds=thresholds)
best_1 = best_threshold_for_fbeta(y_true, y_score, beta=1.0, thresholds=thresholds)
best_2 = best_threshold_for_fbeta(y_true, y_score, beta=2.0, thresholds=thresholds)
best_05["threshold"], best_1["threshold"], best_2["threshold"]
(0.6266666666666667, 0.5333333333333333, 0.5)
fig = make_subplots(
rows=2,
cols=1,
shared_xaxes=True,
vertical_spacing=0.12,
subplot_titles=("Precision & recall vs threshold", "Fβ vs threshold"),
)
fig.add_trace(go.Scatter(x=thresholds, y=precision, mode="lines", name="precision"), row=1, col=1)
fig.add_trace(go.Scatter(x=thresholds, y=recall, mode="lines", name="recall"), row=1, col=1)
for beta, best in [(0.5, best_05), (1.0, best_1), (2.0, best_2)]:
fig.add_trace(
go.Scatter(
x=best["thresholds"],
y=best["fbeta_curve"],
mode="lines",
name=f"F{beta:g}",
),
row=2,
col=1,
)
fig.add_vline(
x=best["threshold"],
line_width=1,
line_dash="dot",
line_color="gray",
row="all",
col=1,
)
fig.update_xaxes(title_text="threshold t", row=2, col=1)
fig.update_yaxes(title_text="value", row=1, col=1)
fig.update_yaxes(title_text="Fβ", row=2, col=1)
fig.update_layout(height=700, legend_orientation="h")
fig.show()
Precision–recall curve + iso-Fβ lines#
As you sweep the threshold (t), you trace out a curve in (recall, precision) space.
For a fixed (\beta), you can also draw iso-Fβ curves. Points on higher iso-curves have higher Fβ.
fig = go.Figure()
fig.add_trace(
go.Scatter(
x=recall,
y=precision,
mode="markers+lines",
marker=dict(
color=thresholds,
colorscale="Viridis",
showscale=True,
colorbar=dict(title="threshold"),
size=6,
),
name="PR (threshold sweep)",
)
)
# Iso-Fβ lines for beta=2
beta_iso = 2.0
beta2 = beta_iso**2
p_grid = np.linspace(1e-3, 1.0, 400)
for F in [0.2, 0.4, 0.6, 0.8]:
denom = (1.0 + beta2) * p_grid - F
r = (F * beta2 * p_grid) / denom
mask = (denom > 0) & (r >= 0) & (r <= 1)
fig.add_trace(
go.Scatter(
x=r[mask],
y=p_grid[mask],
mode="lines",
line=dict(width=1, dash="dot"),
name=f"iso-F{beta_iso:g}={F}",
opacity=0.8,
)
)
fig.update_layout(
title="Precision–Recall curve with iso-F2 lines",
xaxis_title="recall",
yaxis_title="precision",
height=600,
)
fig.update_xaxes(range=[0, 1])
fig.update_yaxes(range=[0, 1])
fig.show()
Using Fβ to optimize a simple classifier#
Two practical ways to “optimize for Fβ” are:
Train a model with a standard loss (e.g., log loss), then tune the threshold (t) to maximize Fβ on a validation set.
Optimize a smooth surrogate of Fβ (use probabilities instead of hard labels) with gradient-based methods.
We’ll do both with a from-scratch logistic regression.
def find_bias_for_target_rate(logits, target_rate, *, iters=60):
lo, hi = -20.0, 20.0
for _ in range(iters):
mid = (lo + hi) / 2.0
rate = sigmoid(logits + mid).mean()
if rate > target_rate:
hi = mid
else:
lo = mid
return (lo + hi) / 2.0
def make_synthetic_logistic_data(n=4000, *, target_pos_rate=0.15, seed=0):
rng_local = np.random.default_rng(seed)
X = rng_local.normal(size=(n, 2))
true_w = np.array([2.0, -1.2])
base_logits = X @ true_w
true_b = find_bias_for_target_rate(base_logits, target_pos_rate)
probs = sigmoid(base_logits + true_b)
y = rng_local.binomial(1, probs).astype(int)
return X, y, probs, (true_w, true_b)
def train_val_test_split(X, y, *, ratios=(0.6, 0.2, 0.2), seed=0):
if not np.isclose(sum(ratios), 1.0):
raise ValueError("ratios must sum to 1")
rng_local = np.random.default_rng(seed)
n = X.shape[0]
perm = rng_local.permutation(n)
X, y = X[perm], y[perm]
n_train = int(ratios[0] * n)
n_val = int(ratios[1] * n)
X_train, y_train = X[:n_train], y[:n_train]
X_val, y_val = X[n_train : n_train + n_val], y[n_train : n_train + n_val]
X_test, y_test = X[n_train + n_val :], y[n_train + n_val :]
return X_train, y_train, X_val, y_val, X_test, y_test
X, y, _, (true_w, true_b) = make_synthetic_logistic_data(n=4000, target_pos_rate=0.12, seed=1)
X_train, y_train, X_val, y_val, X_test, y_test = train_val_test_split(X, y, seed=1)
print("positive rate (train/val/test):", y_train.mean(), y_val.mean(), y_test.mean())
positive rate (train/val/test): 0.12 0.11625 0.115
def add_intercept(X):
X = np.asarray(X, dtype=float)
return np.c_[np.ones((X.shape[0], 1)), X]
def log_loss_and_grad(w, Xb, y, *, l2=0.0, eps=1e-12):
z = Xb @ w
p = sigmoid(z)
y = y.astype(float)
loss = -np.mean(y * np.log(p + eps) + (1.0 - y) * np.log(1.0 - p + eps))
if l2:
loss += 0.5 * l2 * np.sum(w[1:] ** 2)
grad = (Xb.T @ (p - y)) / Xb.shape[0]
if l2:
grad[1:] += l2 * w[1:]
return loss, grad
def fit_logistic_regression_ce(X, y, *, lr=0.2, steps=1500, l2=0.0, seed=0):
rng_local = np.random.default_rng(seed)
Xb = add_intercept(X)
w = rng_local.normal(scale=0.01, size=Xb.shape[1])
history = []
for step in range(steps):
loss, grad = log_loss_and_grad(w, Xb, y, l2=l2)
w -= lr * grad
if step % 20 == 0 or step == steps - 1:
history.append((step, loss))
return w, np.array(history)
w_ce, hist_ce = fit_logistic_regression_ce(X_train, y_train, lr=0.3, steps=1200, l2=1e-3, seed=1)
hist_ce[:5], hist_ce[-5:]
(array([[ 0. , 0.6936],
[20. , 0.3378],
[40. , 0.2818],
[60. , 0.2603],
[80. , 0.249 ]]),
array([[1120. , 0.2243],
[1140. , 0.2243],
[1160. , 0.2243],
[1180. , 0.2243],
[1199. , 0.2243]]))
fig = go.Figure()
fig.add_trace(go.Scatter(x=hist_ce[:, 0], y=hist_ce[:, 1], mode="lines", name="log loss"))
fig.update_layout(title="Logistic regression training (cross-entropy)", xaxis_title="step", yaxis_title="log loss")
fig.show()
def predict_proba(w, X):
Xb = add_intercept(X)
return sigmoid(Xb @ w)
def evaluate_thresholded(y_true, y_score, *, threshold, beta=1.0):
y_pred = (y_score >= threshold).astype(int)
tp, fp, fn, tn = confusion_counts_binary(y_true, y_pred)
precision, recall, fbeta = precision_recall_fbeta_from_counts(tp, fp, fn, beta=beta)
return {
"threshold": float(threshold),
"beta": float(beta),
"precision": float(precision),
"recall": float(recall),
"fbeta": float(fbeta),
"tp": int(tp),
"fp": int(fp),
"fn": int(fn),
"tn": int(tn),
}
val_scores_ce = predict_proba(w_ce, X_val)
test_scores_ce = predict_proba(w_ce, X_test)
betas = [0.5, 1.0, 2.0]
rows = []
for beta in betas:
best = best_threshold_for_fbeta(y_val, val_scores_ce, beta=beta, thresholds=np.linspace(0, 1, 501))
test_eval = evaluate_thresholded(y_test, test_scores_ce, threshold=best["threshold"], beta=beta)
rows.append({
"beta": beta,
"best_val_threshold": best["threshold"],
"val_fbeta": best["fbeta"],
"test_precision": test_eval["precision"],
"test_recall": test_eval["recall"],
"test_fbeta": test_eval["fbeta"],
})
try:
import pandas as pd
pd.DataFrame(rows)
except Exception:
rows
Notice how the optimal threshold changes with β.
With (\beta > 1), you typically pick a lower threshold to increase recall.
With (\beta < 1), you often pick a higher threshold to increase precision.
thresholds = np.linspace(0.0, 1.0, 501)
fig = go.Figure()
for beta in [0.5, 1.0, 2.0]:
best = best_threshold_for_fbeta(y_val, val_scores_ce, beta=beta, thresholds=thresholds)
fig.add_trace(go.Scatter(x=thresholds, y=best["fbeta_curve"], mode="lines", name=f"val F{beta:g}"))
fig.add_trace(
go.Scatter(
x=[best["threshold"]],
y=[best["fbeta"]],
mode="markers",
marker=dict(size=10),
name=f"best t for F{beta:g}",
)
)
fig.update_layout(
title="Validation: Fβ vs threshold (same model, different β)",
xaxis_title="threshold",
yaxis_title="Fβ",
height=500,
)
fig.show()
Direct optimization (optional): a differentiable “soft Fβ” surrogate#
Hard Fβ uses thresholded predictions (\hat{y} \in {0,1}), so it’s not differentiable in the model parameters.
A common trick is to replace (\hat{y}) with the model probability (p\in[0,1]) and define “soft” counts:
Then plug them into the same formula:
This surrogate is smooth in (p), so we can do gradient ascent on a logistic regression model.
Caveat: optimizing (\widetilde{F}_\beta) is not identical to optimizing the hard-thresholded Fβ, but it can be a useful demonstration (and sometimes a practical heuristic).
def soft_fbeta_and_grad(w, Xb, y, *, beta=2.0, l2=0.0, eps=1e-12):
"""Return (soft_fbeta, grad_w) for a logistic regression model p = sigmoid(Xb @ w)."""
if beta <= 0:
raise ValueError("beta must be > 0")
beta2 = beta**2
z = Xb @ w
p = sigmoid(z)
y = y.astype(float)
tp = np.sum(y * p)
sp = np.sum(p) # tp + fp
pos = np.sum(y)
denom = sp + beta2 * pos + eps
f = (1.0 + beta2) * tp / denom
# dF/dp_i
dF_dp = (1.0 + beta2) * (y * denom - tp) / (denom**2)
dF_dz = dF_dp * p * (1.0 - p)
grad = Xb.T @ dF_dz
if l2:
f -= 0.5 * l2 * np.sum(w[1:] ** 2)
grad[1:] -= l2 * w[1:]
return float(f), grad
def fit_logistic_regression_soft_fbeta(X, y, *, beta=2.0, lr=1e-3, steps=4000, l2=1e-3, seed=0):
rng_local = np.random.default_rng(seed)
Xb = add_intercept(X)
w = rng_local.normal(scale=0.01, size=Xb.shape[1])
history = []
for step in range(steps):
f, grad = soft_fbeta_and_grad(w, Xb, y, beta=beta, l2=l2)
w += lr * grad
if step % 50 == 0 or step == steps - 1:
history.append((step, f))
return w, np.array(history)
beta_opt = 2.0
w_soft, hist_soft = fit_logistic_regression_soft_fbeta(
X_train, y_train, beta=beta_opt, lr=2e-3, steps=6000, l2=1e-3, seed=2
)
hist_soft[:5], hist_soft[-5:]
(array([[ 0. , 0.3058],
[ 50. , 0.3105],
[100. , 0.3151],
[150. , 0.3197],
[200. , 0.3242]]),
array([[5800. , 0.5 ],
[5850. , 0.5005],
[5900. , 0.501 ],
[5950. , 0.5015],
[5999. , 0.502 ]]))
fig = go.Figure()
fig.add_trace(go.Scatter(x=hist_soft[:, 0], y=hist_soft[:, 1], mode="lines", name=f"soft F{beta_opt:g}"))
fig.update_layout(
title=f"Logistic regression training (maximize soft F{beta_opt:g})",
xaxis_title="step",
yaxis_title=f"soft F{beta_opt:g}",
)
fig.show()
val_scores_soft = predict_proba(w_soft, X_val)
test_scores_soft = predict_proba(w_soft, X_test)
best_ce = best_threshold_for_fbeta(y_val, val_scores_ce, beta=beta_opt, thresholds=np.linspace(0, 1, 501))
best_soft = best_threshold_for_fbeta(y_val, val_scores_soft, beta=beta_opt, thresholds=np.linspace(0, 1, 501))
test_ce = evaluate_thresholded(y_test, test_scores_ce, threshold=best_ce["threshold"], beta=beta_opt)
test_soft = evaluate_thresholded(y_test, test_scores_soft, threshold=best_soft["threshold"], beta=beta_opt)
rows = [
{
"model": "cross-entropy + threshold tuning",
"val_best_threshold": best_ce["threshold"],
"val_Fbeta": best_ce["fbeta"],
"test_precision": test_ce["precision"],
"test_recall": test_ce["recall"],
"test_Fbeta": test_ce["fbeta"],
},
{
"model": f"maximize soft F{beta_opt:g} + threshold tuning",
"val_best_threshold": best_soft["threshold"],
"val_Fbeta": best_soft["fbeta"],
"test_precision": test_soft["precision"],
"test_recall": test_soft["recall"],
"test_Fbeta": test_soft["fbeta"],
},
]
try:
import pandas as pd
pd.DataFrame(rows)
except Exception:
rows
Pros / cons and when to use Fβ#
Pros#
Focuses on the positive class (TP/FP/FN) — useful for imbalanced problems where TN is less informative
Adjustable trade-off via (\beta): pick recall-heavy ((\beta>1)) or precision-heavy ((\beta<1))
Single number summarizing the precision–recall trade-off (easy to compare models)
Cons#
Threshold-dependent: you must choose a threshold (or a policy) to get a meaningful number
Not a proper scoring rule: it does not reward well-calibrated probabilities the way log loss / Brier score do
Ignores TN: can be misleading when TN matters (e.g., overall error rate is critical)
Not smooth in the hard form (can’t be directly optimized with gradient descent without surrogates)
When it’s a good fit#
Information retrieval / search / recommendation (relevant items are “positive”)
Medical screening or safety monitoring (often (\beta>1) to favor recall)
Fraud / abuse detection (often (\beta<1) if false positives are expensive)
Pitfalls + diagnostics#
Pick (\beta) based on real costs (FN vs FP), not after looking at the test set.
Always tune thresholds on a validation set (or use cross-validation).
For highly imbalanced data, look at the full precision–recall curve; a single Fβ can hide failure modes.
Be explicit about the positive class (
pos_label).Decide how to handle zero-division cases (no predicted positives, or no actual positives).
Exercises#
Implement macro-averaged Fβ for multiclass classification via one-vs-rest.
For a fixed model, show how the best threshold changes when (\beta\in{0.25,0.5,1,2,4}).
Compare optimizing (a) log loss + threshold tuning vs (b) a soft-Fβ surrogate on a more extreme imbalance (e.g., 1% positives).
References#
scikit-learn
fbeta_score: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fbeta_score.htmlscikit-learn metrics overview: https://scikit-learn.org/stable/modules/model_evaluation.html
C. J. van Rijsbergen, Information Retrieval (discussion of the (F_\beta) measure)